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ACCELERATED DIMENSION-INDEPENDENT ADAPTIVE 

METROPOLIS 


YUXIN CHEN*, DAVID KEYES^, KODY J.H. LAW*, AND HATEM LTAIEF§ 

Abstract. This work considers black-box Bayesian inference over high-dimensional parame¬ 
ter spaces. The well-known adaptive Metropolis (AM) algorithm m is extended herein to scale 
asymptotically uniformly with respect to the underlying parameter dimension for Gaussian targets, 
by respecting the variance of the target. The resulting algorithm, referred to as the dimension- 
independent adaptive Metropolis (DIAM) algorithm, also shows improved performance with respect 
to adaptive Metropolis on non-Gaussian targets. This algorithm is further improved, and the pos¬ 
sibility of probing high-dimensional targets is enabled, via GPU-accelerated numerical libraries and 
periodically synchronized concurrent chains (justified a posteriori). Asymptotically in dimension, this 
GPU implementation exhibits a factor of four improvement versus a competitive CPU-based Intel 
MKL parallel version alone. Strong scaling to concurrent chains is exhibited, through a combination 
of longer time per sample batch (weak scaling) and yet fewer necessary samples to convergence. The 
algorithm performance is illustrated on several Gaussian and non-Gaussian target examples, in which 
the dimension may be in excess of one thousand. 

Key words. Markov chain Monte Carlo, big data, Bayesian inference, adaptive Metropolis, 
Metropolis-Hastings, BLAS, GPU-acceleration, High performance computing. 


1. Introduction. Recent years have seen increasing activity in the areas of un¬ 
certainty quantification and big data, largely enabled by the progress of computational 
science, which itself is enabled by ever more powerful computers and the symbiosis of 
this architectural brute force with innovative algorithmic advances. In particular, the 
solution of a forward problem, given by an ordinary differential equation (ODE) or 
partial differential equation (PDE), may be viewed as a distributed quantity induced 
by the uncertainty of input parameters |46j . rather than as a deterministic quantity. 
When the input parameters themselves are spatially (and/or temporally) extended, 
one is faced with much higher-dimensional problems, and indeed distributions over 
function spaces in principle O [501 ESI • In the context of Bayesian inference, this leads 
to the notion of a Bayesian analogue of the classical inverse problem [ziiisiiiniiH]. 
Such problems are enormously challenging both algorithmically and computationally, 
and largely motivate the present work. At the same time, a very similar problem 
of big data is recently attracting a lot of attention. In the former case, even in the 
hypothetical case of full-field measurements, when the amount of data is infinite, the 
effective dimension of the data, or the space where posterior measure concentrates 
with respect to the prior, is often quite small with respect to that of the underlying 
parameter of interest, due to smoothing of the forward problem [TldlllillM]. The 
big data problem directly confronts the case of genuinely high-dimensional posterior 
distributions, i.e., the posterior differs significantly from the prior in the whole space 

[7011111551EHIIIH]- 

1.1. Algorithmic introduction. Probability distributions over low-dimensional 
spaces are straightforward to represent via the associated probability density. It is 
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impossible, however, to represent densities in higher than a few dimensions. But one 
can do something that is usually sufficient in scientific utility: one can sample the 
probability distribution with Monte Carlo. Probability distributions arising from a 
Bayesian framework introduce another layer of complexity in Monte Carlo, as typi¬ 
cally one can only evaluate the posterior distribution, up to a normalizing constant, 
while direct sampling methods typically do not exist. One must resort to methods 
such as importance or rejection sampling or Markov chain Monte Carlo (MCMC) 

[301 US]. 

A primary workhorse of Bayesian computation is MCMC. A popular and versatile 
MCMC algorithm is the Metropolis-Bastings algorithm (MB), introduced in jSSj and 
later revised to its current form in [3S]. The adaptive Metropolis algorithm (AM) 
[38] . and derivatives thereof (DRAM (32], ASWAM [4], SCAM [34], RAM [77], etc.), 
construct proposals based on the empirical covariance arising from the current trajec¬ 
tory, i.e., the past samples. These proposals are perhaps the most versatile, effective, 
and useful among the MB-type algorithms for low-dimensional and reasonably well- 
behaved targets, for example unimodal up to a dimension of 100. As the proposal 
depends on the chain history, it is no longer Markov, although there is theoretical work 
guaranteeing convergence under fairly general conditions |3l |62l [681 ESI ES] • For tar¬ 
gets in which the Bessian of logarithm has a strong local dependence, gradient-based 
proposals such as the Metropolis-Adjusted Langevin algorithm (MALA) [ST] US] or 
the Bamiltonian Monte Carlo (BMC) algorithm |S0] or their manifold extensions 
m can improve the convergence time, at the cost of providing the gradients, which 
may be nontrivial to obtain or may not even exist. It can be shown that such propos¬ 
als, as well as the random walk (RW) proposal upon which the AM algorithms are 
based, can be derived from the explicit discretization of a certain stochastic differential 
equation (SDE). Based on such diffusion limits, it has been shown that for underlying 
dimension d, the variance, or squared step-size, taken by random walk Metropolis algo¬ 
rithm (RW), MALA, and BMC algorithms must scale as 0{l/d) [64l|6l|5T], 0{d~^^^) 
[65l|60],andC>(d-i/‘‘) 0, respectively. This naturally translates to decorrelation time 
of the inverse order, i.e., the number of steps required to obtain an almost indepen¬ 
dent sample is 0{d), 0{d^^^), and 0{d^^^) [50]. For high dimensional targets, this is 
naturally impractical, and this has been a limiting factor for the application of these 
algorithms to targets over higher dimensional spaces, although the gradient-based 
methods can still be effective in high dimensions if Bessian information is incorporated 
efficiently Hillll. If a target arising from a Bayesian inverse problem is well-defined 
in the function-space limit, as it should be, then proposals can be designed to respect 
that limit m- When the problem is discretized, such proposals exhibit a decorre¬ 
lation time that is independent of the refinement of the mesh towards that limit; 
in other words, independent of the underlying dimension mm, or 0{1). Recently 
the work |45[ introduced an algorithm that incorporates general operator-weighting, 
and in particular Bessian information, into function-space proposals which may be 
derived from time-inhomogeneous discretization of the Ornstein-Uhlenbeck SDE. The 
work m goes one step further, using prior-preconditioned Bessian information to 
adaptively identify the space of posterior concentration, and then using empirical co- 
variance information within that low-dimensional space to adaptively precondition a 
time-inhomogeneous discretization of the Langevin SDE. 

In general, the amount of elaborate forward simulation code in the world, whether 
it be high-dimensional ODE or PDE, far outweighs the associated gradient and adjoint 
codes, so often such information may not be available. Indeed the possibility of 
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avoiding the person-hours required to construct such code is therefore highly valuable, 
and provides good motivation for constructing non-intrusive, black-box^ or gradient- 
free algorithms. This work presents an alternative approach to those described above, 
in an attempt to combine the best of the worlds above without resorting to gradient 
information. Indeed, the pre-conditioned Crank-Nicolson (pCN) proposal of [13] arises 
from a Crank-Nicolson discretization of an Ornstein-Uhlenbeck SDE which preserves 
a certain Gaussian measure. In contrast, the RW proposal arises from an Euler- 
Maruyama discretization of a diffusion which spreads mass to infinity and has no 
invariant measure. It is this property that provides the 0{1) decorrelation time of 
the former versus the 0{d) of the later. From this viewpoint, the advantage of the 
former is clear even in the absence of a function-space limit. Herein we construct 
a proposal inspired by the pCN that preserves a distribution proportional to the 
empirical Gaussian obtained from past samples, yielding an asymptotically dimension- 
independent adaptive Metropolis algorithm, which will be abbreviated DIAM. That 
is, the decorrelation time is expected to scale as 0{1) for reasonably well-behaved 
distributions, and this can be proven for the Gaussian case. Nonetheless, this will 
result in a gain of only 0{d}^^) in convergence time for root mean squared error 
(RMSE) quantities. Therefore, the value is still limited as long as one is limited to 
d < 100. On the other hand, when the dimension of the target becomes much larger, 
the cost of adaptation itself may become a limiting factor due to the required linear 
algebra. The computational contribution consists of mitigating this effect. 

1.2. Computational introduction. From the computational perspective, the 
fundamental limiting operations that comprise the AM algorithm, and the dimension- 
independent adaptive Metropolis algorithm (DIAM) extension proposed here, are 
Level 2 and 3 Basic Linear Algebra Subprograms (BLAS) operations, scaling tra¬ 
ditionally as 0{d’^) and 0{d^), in particular, dense matrix-vector, matrix-matrix mul¬ 
tiplication, and Cholesky-based matrix inversion. These operations prevent its use in 
high dimensions, even given the algorithmic advances outlined in the previous sec¬ 
tion. However, it is shown here that one may impose a lag-time of 0{d) between 
Cholesky-based matrix inversion, and hence block updates of the covariance, without 
increasing the required number of samples to convergence. The algorithm is thereby 
immediately reduced to 0{d?) rather than 0{d^), in the sense that the cost to obtain 
N samples is 0{Nd^) (assuming the cost of evaluating the logarithm of the unnormal¬ 
ized density is at most 0{d^)). It is also feasible to reduce the cost of the algorithm 
to 0{d‘^) by using low-rank Cholesky updates [JO] [77]. It is proposed here to use 
state-of-the-art GPU acceleration of dense linear algebra operations within the fun¬ 
damental operations of the AM and DIAM algorithms. Compute-bound operations, 
i.e.. Level 3 BLAS kernels, usually benefit the most from these hardware accelerators 
because they are able to stress the floating-point units with significant data reuse 
at the high level of the memory hierarchy, and they attain a decent percentage of 
the theoretical peak performance of the underlying hardware. Memory-bound opera¬ 
tions, i.e.. Level 2 BLAS kernels, are however limited by the bus bandwidth and how 
fast the requested data can be fetched to the floating-point units, due to negligible 
data reuse. Accelerators provide much higher bandwidth compared to standard x86 
architecture and, therefore, memory-bound kernels can still be accelerated on such 
hardware. All these assume that the data resides already on the GPU memory, which 
is not always the case for current architecture model. Data has to be offloaded from 
the host (CPU) memory to the device (GPU) memory through a thin pipe called the 
Peripheral Component Interconnect Express (PCIe), which has an order magnitude 
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lower bandwidth than the GPU. It is illustrated that by distributing the Level 2 BLAS 
operations across the GPU, the quadratic scaling is reduced by a factor of almost 4, 
by a combination of the slow data transfer through PCIe, mitigated by asynchronous 
processing, and the speed-up of the resultant Level 2 BLAS operations owing to the 
increased memory bandwidth on the GPU. 

The clock frequency of a single processor of GMOS logic has nearly reached its 
physical limit due to power dissipation constraints. The multicore era has permitted 
the introduction of multiple low-frequency cores on a single chip. This trend has 
been reinforced moving forward with the international exascale roadmap |19] . where 
streaming multiprocessor architectures (NVIDIA GPUs, Intel Xeon Phi, etc.) com¬ 
posed of lightweight cores will be the norm for future exascale systems. The value of 
brute force concurrent (embarrassing) parallelization is therefore seeing an increase 
in value. While traditional Monte Garlo methods enjoy this property, Markov chain 
Monte Garlo methods do not, as they are inherently serial in nature. Nonetheless, one 
can a posteriori justify the merging of concurrent parallel chains within the framework 
of [501110], using the so-called potential scale reduction factor (PSRF) as a diagnostic 
to measure convergence. This is the approach to parallelization of AM taken in the 
recent works nsiin], although neither work confronts a high dimensional parameter. 
In [15] the objective is to sufficiently explore the state-space in order to identify a 
partition for regional adaptation. In this approach is used to mitigate the cost 
of very expensive forward solves. Herein, the approach is proposed as a general par¬ 
allelization strategy for the algorithm, indeed with almost perfect scaling efficiency 
in terms of time. The convergence time of the empirical covariance is decreased by 
concatenating samples from the concurrent chains through periodic synchronization. 
This gain makes up for the slight slow-down in the collection of a given batch of 
samples, resulting in effectively strong scaling with respect to convergence time. It is 
shown that this allows black-box sampling of targets over very high dimensions. As 
the focus of this work is the new DIAM algorithm, the principle is illustrated for that 
algorithm, but the same principle is expected to apply to AM. 

It should be noted that many more elaborate approaches to parallelization of 
Bayesian computation have recently emerged, including [71 [HOI mi HI izi iMi mi- 
For example, the authors in m and m developed a GUDA kernel to tackle the 
most time-consuming phase of their MGMG simulation using SIMD parallelizations 
to run on the massive number of GUDA cores available on the GPU card. Our 
numerical algorithm relies on BLAS operations, for which most vendors provide highly 
optimized implementations on their hardware (e.g. cuBLAS for NVIDIA). Moreover, 
our implementation is portable across a range of vendor hardware, thanks to the 
legacy of the BLAS library. 

It should also be noted that more advanced Monte Garlo methods exist for 
Bayesian computation, such as population-based MGMG [HIST], equi-energy sam¬ 
plers [13], and sequential Monte Carlo samplers m- Such methods are indeed neces¬ 
sary for sampling from very complex multi-modal distributions, but it should be noted 
that Metropolis-Hastings algorithms appear within these algorithms as a fundamental 
component, similarly to the way the BLAS operations appear in the MH algorithms 
as a fundamental component. The proposed DIAM algorithm is therefore expected 
to have a great impact as a fundamental black-box MH algorithm. 

The rest of this paper is organized as follows. In Section[^the problem of Bayesian 
inference in high dimensions is introduced precisely, detailed definitions of the base¬ 
line and benchmark algorithms are given, and finally the concurrent formulation is 
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presented as well as the convergence diagnostic for its a posteriori justification. In 
Section]^ the algorithms are illustrated by some numerical experiments. In Section]^ 
advanced GPU acceleration techniques are introduced, as well as the logistical frame¬ 
work for extending to multiple chains. Performance results are highlighted in Section 
[^and we conclude in Section]^ 

2. Bayesian inference in high dimensions. 

2.1. General problem formulation. The problem considered here is the fol¬ 
lowing. Given a quantity of interest M, estimate its expectation with respect 

to a probability measure tt 


7r(vj) := 



(p{x)TT{x)dx 




X^ 


TT. 


( 2 . 1 ) 


The notation “x ~ tt” indicates that the random variable x follows the distribution 
of TT. The convergence of the approximation given above is a consequence of the Law 
of large numbers for independent identically distributed (i.i.d.) random variables Xi 
EH, and an extension thereof under an assumption of sufficient decay of correlation 

[51]. 

Let ry : — >• K+, where 1R+ = {t G M.;t > 0}, and assume Z := J^ir]{x)dx < 

oo. Then tt = t]/Z is a probability density, in the sense that tt : — ?> K_|_ and 

Tr(x)dx = 1. Assume that given x € ri{x) can be readily evaluated, but that 
there is no direct method for sampling from tt. Probability measures in the present 
work will always have densities with respect to Lebesgue measure, and the same 
notation will be used both for the measure tt : cr(K‘^) —)■ [0,1], where (t(IR‘^) refers 
to the sigma algebra of measurable sets in and its density tt : —)■ IR+ with 

Jgd Tr{x)dx = 1. This should not cause confusion. 

Such a problem often arises in a Bayesian context, in which case one has some 
observation y such that y\x ~ L(x,-), where L{x,-) is the likelihood which gives the 
distribution of the data y conditional on x, and one knows how to evaluate the density 
L{x,y) point-wise. The density of the posterior distribution of x\y is given by 

tt{x) = ^L{x-,y)TTo{x), Z=l L{x-,y)TTo{x)dx, (2.2) 

^ Jr‘‘ 


where ttq is the prior distribution of x before any observation is made, L{x;y) is the 
density associated to the law of y\x, and the notation is used to emphasize that 
the observation y G is fixed to a given observed value, while x is allowed to vary 

m- 


Particular attention will be paid to the case in which d is large. For example, in 
the context of Bayesian inverse problems, d —> oo in principle and it is appropriate to 
formulate the problem as the discretization of a limiting measure on a function-space 
X. In this case the target is a measure /i : X 
form 




y.{X) = 1, and (2.2) takes the 


dy 

dyo 


(x) 


1 


Hx;y), 


Z = 



Hx;y)po{dx), 


(2.3) 


where dy/dgiQ denotes the Radon-Nikodym derivative of y with respect to i.e., 
the ratio yidu)/y^idu) of infinitesimal volume elements at the point u. A sufficient 
requirement for the above to be well-defined is that < iio{L{-;y)) < c for some 
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c G ( 0 , 00 ) [73]. This context will not be considered further, however this is the 
problem to have in mind when we refer to the d —>■ 00 limit for Bayesian inverse 
problems. 

The case of big data may also come increasingly to fit into this scenario. While 
it has come to refer in the statistics community to the case of large dy |42l [55] . 
which need not imply large d, it would be natural to try to explain high-dimensional 
data in terms of a high-dimensional parameter. This may again lead to a posterior 
distribution over a high-dimensional space. For example, in the context of regression, 
access to an increasing number of observations and potential covariates may inspire 
one to consider an increasing number of covariates as well as an increasing number of 
observations. In the Bayesian inverse problem context, the data may often be given 
as a noisy observation of the solution of a PDF with the parameter as input, and the 
intrinsic smoothing property which provides well-posedness of PDF may hence reduce 
the effective dimension of the data even in the case of full-held measurements when 
dj, —>■ 00 . In the big-data context, on the other hand, the data may be genuinely 
informative over increasingly high-dimensional parameter spaces which can lead to 
higher effective dimension of the posterior with respect to the prior in comparison 
with the Bayesian inverse problem, albeit with a generally much simpler forward 
model connecting the parameter to the observations. The general black-box methods 
developed here are expected to be effective in both cases and more. 

2.2. Markov chain Monte Carlo. Introduce a Markov chain with transition 
kernel /C : x K’*'. Let denote the set of probability densities 

over i.e., functions p :—?► —>■ IR+ such that J^j,p{x)dx = I. By the defini¬ 

tion of Markov kernel, for q G V, one has that p{y) = q{x)IC{x,y)dx G V. The 
following short-hand notation is therefore commonly used p = qlC, while the equa¬ 
tion p{(p) = /jgd /jjd q{x)lC{x,y)dx(p{y)dy inspires the analogous notation / = ICp = 

IC{x,y)(p{y)dy, so that p(<p) = {qIC){(p) = qiJCp) = ?(/). The unfamiliar reader 
can think of the discrete state-space analogy of row vectors representing probability 
distributions, column vectors representing quantities of interest, and the transition 
kernel given by a row stochastic matrix. A density tt such that tt = /Ctt is referred to 
as (the density of) an invariant measure, and a sufficient condition is reversibility 

f Tr{dx)IC{x,dx') = f Tr{dx')IC{x',dx). (2.4) 

Under additional assumptions of irreducibility and aperiodicity, one has ergodicity of 
the chain, i.e., limjv_>.oo {xq, ■)—tt\tv = 0 for any xq G and rates can be derived 
depending essentially on the rate of decorrelation of the chain. A consequence of this 
is that if one sets Xn /C"(a;o, •) = IC{Xn-i, •)) then a;„ is distributed approximately 
according to the target tt, hence such {Xn-M}n=M+i he used in the approximation 

([ 23 . 

Indeed if xm ~ tt and the autocorrelation function (ACF) := ¥.[xm+n — 
E(x)][a;m — E(x)]/(E[a; — E(a::)]^) = p" for some p G (0,1), then a simple calcula¬ 
tion shows that 




I 

N 


iV+M 


7r(p) 




E^(x)]2(1 + 2/(1-p)), (2.5) 


where the geometric series identity 0 = P” = f/(l ~ P) used to simplify 

the integrated autocorrelation time (IACT) I -|- 20. Notice that by comparison to the 
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celebrated Central Limit Theorem [61] for i.i.d. draws, the effective sample size of 
the correlated ensemble, with respect to the i.i.d. case, may be defined as N^g = 
N/{1 + 2Q). 

The Metropolis-Hastings (MH) algorithm, introduced in [5^ and refined to its 
present version in [35] , is perhaps the most popular and versatile amongst the MCMC 
methods. It states that an essentially arbitrarily chosen transition kernel Q [76] can 
be composed with an accept/reject step as follows in order to satisfy reversibility (2.4| 
with respect to tt. Given the next sample Xn+i ^ fC{xn, ■), where the kernel 1C is 
defined as follows 

• Let x' Q{xn,-), 

• Let 


^n+1 


x' w.p. min{l, a(x„, x')} 
Xn else. 


where the acceptance probability a is 

a{Xn,x') 


defined as 

_ 'n'ix')Q{x',Xn) 

TriXn)Q{Xn,x') ' 


( 2 . 6 ) 


(2.7) 


There are clearly infinitely many possible choices of Q, which leads to a wide range of 
behaviors of the associated kernels JC. Essentially one aims to minimize the correlation 
between the subsequent samples, which in turn results in a smaller p in (2.5) above 
and hence smaller 0 and larger effective sample size Tbe Metropolis-Hastings 
algorithm is ubiquitous, not only as a method in its own right, but also as a funda¬ 
mental component for many other Bayesian computation algorithms, as mentioned at 
the end of Section [TJ 


2.3. Advanced Metropolis-Hastings proposals. This subsection will focus 
on the MH algorithm introduced in the previous subsection. The most basic Metropolis- 
Hastings proposal will be introduced (indeed, the Metropolis algorithm), followed by 
the more advanced black-box, or gradient-free, algorithms which were mentioned in 
Sec. [^ Finally, the algorithm introduced in the present work will be defined. 

2.3.1. Random Walk. The presentation begins with the SDE 


dx = AdW (2.8) 

where A S is positive definite and dW is an independent increment of Brownian 
motion dW ~ N{0,dt x I) [59]. An Euler-Maruyama discretization of this equation 
with step-size /3 (time-step /3^) gives [H] 


Xn+l = Xn+ PAWn, (2.9) 

where Wn ~ A(0, 1) and Wn -L Wm for all n, m. The standard random walk (RW) is 
defined by the above equation so that Q{xn,Xn+i) = Q{xn+i,Xn) oc exp{— 2 ^|A“^(a:„— 
Xn+i)P}- The fact that the proposal density is symmetric means that a{x,x') = 
7T{x')/Tr{x). Often A = / is chosen as the identity matrix, although it is possible to 
make other educated choices, for example the prior covariance in a Bayesian context, 
the Hessian close to the maximizer, or some other approximation of the covariance of 
the target. 
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2.3.2. Preconditioned Crank-Nicolson. In turn, the Ornstein-Uhlenbeck pro¬ 
cess is defined by the following SDE 

dx = -Bxdt + y/WAdW, (2.10) 

where A is as above, B is symmetric and positive definite, y/B denotes the symmetric 
matrix square root, and it is assumed that BA = AB. It can be shown that the above 
equation has invariant distribution A'^(0, making it a reasonable equation to 

aim to approximate if AA^ is a good approximation of the covariance of the target. 
It was proposed in mm to use the above SDE as a starting point with A = VC' and 
B — I for posterior measures with Gaussian prior IV(0, C), and furthermore to use a 
Crank-Nicolson discretization scheme, leading to the following update, for time-step 
<5 (upon multiplication by 2): 

{2 + 6)xn+i = {2-5)xn + 2V^AWn. ( 2 . 11 ) 

Setting step-size /? = 2^/^/{2 + 5) one has the pre-conditioned Crank-Nicolson (pCN) 
proposal [T3] 


^n+1 


= \/r^ 


■PAWn 


( 2 . 12 ) 


with Wn as above. Notice that this equation preserves the measure iV(0, AA^), just 
like its continuum counterpart (2.10). This means ifp is the density of A^(0, ^yl^) then 


P = pQ, which in turn implies p{x)Q{x, x') = p{x')Q{x',x). So, if tt{x) = q{x)p{x) 
for some g, then the MH algorithm with this proposal has the following acceptance 
probability a(x,x') = q(x')/q(x). This is useful in case the prior is Gaussian, as 
only the likelihood appears in the acceptance. There is nothing intrinsically finite¬ 


dimensional about (2.10), or its temporal discretization (2.12), so one can see how 


this allows the definition of a function-space algorithm, i.e., one which is defined in 


the limit d —>■ oo for targets of the form (2.3) in which /ig is Gaussian. Indeed as long 


as one can construct a proposal which is reversible with respect to the prior, then the 


same theory extends to non-Gaussian prior m- By observing that the form of ( |2.12[ ) 
may be extended with operators B replacing the scalar /3, the work of [45j introduced 
general operator-weighted proposals which are reversible with respect to priors of the 
form N{m, AA^): 


1+1 = m-\- A{I — BB^y^'^A ^(xn — m) ABWn- 


(2.13) 


For the above proposals, Hessian information may be incorporated if it is available, 
and this was the strategy of |45j . This was extended to more general proposals includ¬ 
ing also gradient information, and given the general name of dimension-independent 
likelihood-informed (DILI) proposals in |16j . The name derives from judicious incor¬ 
poration of the linear subspace where the posterior concentrates with respect to the 
prior, the likelihood-informed space (LIS) |17) . 

It has been shown in [M] that for proposals of the form (2.9) one must have 
/3^ = 0{l/d), thereby leading to a decorrelation-time of 0{d). In turn, by virtue of 
being defined in the function-space limit, the proposals described above allow /3 = 
0{\) with respect to parameter dimension. Of course, the effective data dimension, 
i.e., the dimension of the LIS, will indeed still play a role for the above proposals. 


although it can be mitigated for DILI proposals, in particular those of the type (2.13) 
by scaling the data-informed directions appropriately. 
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2.3.3. Adaptive Metropolis. When gradients are unavailable, as assumed in 
the present work, one way to improve upon the proposals (2.9) and (2.12) above is 
with empirical covariance information, and this leads to the adaptive Metropolis (AM) 
algorithm (331 • Let 


1 " 
Cri — ^ 

i=\ 
1 " 

TTlji ^ ^ 

71 




2=1 


(2.14) 

(2.15) 


and choose A„ such that AnA^ = Cn- Plugging this into (2.9) yields the classical 
adaptive Metropolis proposal. The work [64] identifies an optimal acceptance rate of 
0.234, and the later work [3] proposes to scale adaptively the step-size within the AM 
algorithm to target such acceptance ratio. This will be the version of AM considered 
here. 


2.3.4. Dimension-independent adaptive Metropolis. The new algorithm 
introduced here was already alluded to in nans]. This follows naturally from the 
above presentation by substituting an A„ such that = C„ into (2.12). In fact, a 

reference point should possibly also be taken into account, in which case the proposal 
takes the form 


Xn-\-\ — Xj-ef ^/l (^n ^ref) A (2.16) 

The reference point x^ef may be chosen as the maximum a posteriori (MAP) estimator, 
i.e., xmap = argmax,j,7r(x), if this is available. Or else, it may be adapted to the 
empirical mean. It is worth dwelling on several points that make this proposal, and 
the resultant MH algorithm, attractive: 

• This proposal asymptotically targets N{uref,Cac>), which is the best Gaus¬ 
sian approximation of the target in case Uref = —>■ rrioo, for example as 

measured by Kullback-Liebler (KL) distance |44j . 

• As /3 —1, the algorithm converges to the independence sampler. Hence, for a 
Gaussian target, it is easy to see that the acceptance probability approaches 
1 , following from the previous point. 

• In the non-Gaussian case, the variance of the proposals will asymptotically 
coincide with the variance of the target, for any /3. In turn, the variance of 
the proposals from the AM algorithm will be (1 -I- /3^)C'„. So in order for its 
trace, i.e., the expected norm of the AM proposals, to be on par with the 
target, one will necessarily need to choose = 0{l/d). 

• Following from the above, for a Gaussian target, the asymptotic decorrelation¬ 
time of the new algorithm is 0{1), as opposed 0{d) for the AM algorithm. 
The new algorithm will hence be called dimension-independent adaptive Metropo¬ 
lis (DIAM). 

For nonlinear/non-Gaussian targets, it will be necessary to modify the above with 
some additive inflation factor a > 1 as follows 

Xn-t-l — A \/1 {Xn X^ef) -f (2.17) 

^In this case, it should be set to zero until some sufficiently large n to avoid too many abrupt 
jumps of the pivot. 
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Notice that as long as a S [1, \/2], the proposal covariance is still smaller than in the 
AM case, and one therefore expects improved performance, although some dependence 
on dimension will then exist. This point requires further investigation. 

2.4. Concurrent chains. It is relevant to discuss the potential of “embarrass¬ 
ingly parallel” MCMC. This is a controversial topic, since MCMC is an intrinsically 
serial algorithm and convergence proofs typically rely on this fact. Nonetheless, the 
works [261 llOj describe a convergence diagnostic based on running multiple chains 
and comparing the between-chain and within-chain covariances. Once this diagnostic 
indicates convergence, one is then justified a posteriori to merge the samples from the 
different chains. 

2.4.1. Logistics. Denote P chains by and let k denote the number of 

batches which have been done. Each of these is run for M intervals of length niag and 
the local first two moments are collected periodically. 


mniag 


= 




^ (m - l)niag 
mniag 


k.m— 1 


^lag 

mniag 


mniag 


i: 

i=(m—l)niag-|-l 


1 

mniag . 


mniag 


E 

i={m—l)niag + l 

(m - l)niag p 


mniag 

mniag ^ 


^“^lag .ictg. 

After each niag update, local updates of the global moments are made 
oP,glob _ feAfi^niag ^glob 

“ IkMP + m)niag ^ 


rrV 


p,glob 


fcAfi^niag glob 


{kMP + m)niag 
followed by a local update of the global covariance 

^p,glob _ cP,glob _ p,glob/ p.globNT 


(2.18) 


(2.19) 


knnii^g p 

(fcMP-tm)niag 

(2.20) 

p 

{kMP + m)nug 

(2.21) 


( 2 . 22 ) 


This is used within the individual steps of the algorithm (2.16). Then, each time 
m = M, the local samples from the P chains are merged into global moments so they 
can be shared 


^glob 


glob 

= 


(fc-i) 


(fc-i) 

k 


1 ^ 

oglob . a p 

‘^k-1 fap 

p=l 

P 

(2.23) 

P=1 

(2.24) 


At this point, one can compute the global covariance once, or just return the moments 
to the individual chains to continue in parallel. This procedure can be optimized, but 
it is outside the scope of the present work. 















Accelerated DIAM 


11 


2.4.2. Potential scale reduction factor. As mentioned above, the potential 
scale reduction factor (PSRF) convergence diagnostic will be used for a posteriori 
justification of chain merging. It is defined as follows. Start P chains, with initial 
conditions which are over-dispersed with respect to the target. Define the following 
within-chain quantities for each p as follows 


qP _ qP 

k=l 


K 


= -y 


K 


m 


MK — / ,'^k.M-' 

k=l 


Now define the global quantities for * = 1,..., d: 

MKn\^g ^ 


B, = 


W,, = 




p_l ^y'^MK 

p=l 


)l 5 


MK Tiling 


{MKni^g - l)P 


JKci 




(2.25) 


(2.26) 

(2.27) 


where = *5'^^ — (TOMA:)(''^Mi<’)^- quantity is referred to as the 

between-chain variance, representing (a factor MKn\g_g times) the variance between 
the means computed in the individual chains. The second is the average within-chain 
variance across the chains, and is referred to as the within-chain variance. These 
quantities both approximate the variance. Now define 


MXniag - 1 / ^ + 1 \ B^ 

MKni^g ^ {pMKni^g) W,' 


(2.28) 


The PSRF in this direction is given by y/Ri- One expects that ^/Ri > 1 and clearly 
one has that y/Ri —>■ 1 as AT —>■ oo. The indicator for convergence is y/Rt — 1 < TOL, 
where TOL is taken to be some number smaller than 0.2. See [Ml HD] for further 
details. 


3. Numerical experiments. This section consists of a systematic collection of 
numerical experiments that present the algorithms defined in this paper. 

3.1. Description of the test cases. To begin with, several random posterior 
densities are introduced. First a standard normal random matrix A G is gen¬ 

erated, and used to construct a random symmetric matrix B = AA^. Such matrix 
has a spectrum with maximum eigenvalue 0{d) and minimum eigenvalue close to zero 
(r = d) or zero {r < d). To mimic the case of a posterior distribution, with standard 
normal prior and log-likelihood —the target is fixed as N{0,C), where the 
covariance is set to the form C = {B -\- I)~^. This covariance has smallest eigen¬ 
value 0{l/d) and largest close to 1, which will emphasize the effect of anisotropy. 
Furthermore, evaluating the target in these cases requires a dense matrix vector mul¬ 
tiplication which has a complexity 0{d?) and is thus greater than or equal to the 
cost of a typical black-box PDF forward solver one may encounter in a more realistic 
example. The following “twisting” function is introduced 

(j){x) = (xi, X2 + bix\,x^, XA -f 630 : 3 ,..., Xd/io + &d/io-ia^d/io-i> 2;d/io+i, ■ ■ ■, a;^). 
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which allows the construction of simple “boomerang” shaped targets with exactly 
computable moments. The following four Gaussian cases and two non-Gaussian cases 
are considered: 

. tti = N{0, C={B + /)-!), r = d (full-rank, cond(C') = 0{d)), 

• tt 2 = NiO,C= {B/d + I)-^), r = d, (full-rank, cond(C) = 0(1)), 

• TTs = iV(0, C = {B + r = d/10, (low-rank, cond(C') = 0{d)), 

• 7 r 4 = N{0,C = ydiag[(cr“^n“"^-b l)“^]„=i (full-rank, cond(C) = 

0(a-2)) 

• 7 r 5 = 7 rioyo(()o , bi = ba- /Vd, b = 0.3 (non-Gaussian, mildly twisted), 

• 7 r 6 = 7 rio\/o())o , bi = ba//"^/Vd, 6 = 2 (non-Gaussian, strongly twisted), 
where VT,V^ = C is the ordered eigendecomposition of C from tti, such that the 
first eigenpair corresponds to the smallest eigenvalue of C. Notice that the Jacobian 
determinant of </> is 1 , so a change of variables is trivial. Also, one can compute the 
maximizer of for all j and it is 0. Furthermore, the mean and variance of tts and 
TTe for i = 2,4, • • • , d /10 are given by 

E[{V^xU] = -6,_ia/_i, E[(FTa:), - E[{V^x),]]^ = af + 2bj_i4-i. 


where a/ are the variances of the component under tti o , i.e., the diagonal 
element in E. For the others, the mean is 0 and the covariance is C, of course. For the 


last two non-Gaussian distributions a > 1 must be tuned in (2.17), to allow sufficient 
spread in the proposal. This is purely heuristic. 

The above targets are all randomly generated, but chosen to mimic certain prob¬ 
lems that arise in practice. We fix a modestly high dimension (i=100. The target 
TTi has the structure one might encounter in a big data problem, where we reduce 
the dimension of the data to dy = d. This target is highly anisotropic because the 
covariance has a big a condition number, which may or may not be the case for a 
big data problem, but which makes the problem more challenging. The target 7 r 2 
is generated by deliberately reducing the condition number from 0{d) to 0 ( 1 ), thus 
making a clear comparison with tti to show how condition number impacts the algo¬ 
rithm efficiency. The target simulates the context of a Bayesian inverse problem, 
in which the posterior is low-rank with respect to the prior. The target 7 r 4 has the 
structure of a Bayesian inverse problem with “smoothing” forward map, for example 
from a PDF forward solve, given by the decaying spectrum of the likelihood. The 
parameter (t“^ in this case corresponds to 1/variance on the data. Smaller variance 
implies bigger condition number, which makes this distribution more anisotropic and 
thus harder to sample from. The targets tt^ and ttq are non-Gaussian distributions: 
TTs is a mildly twisted Gaussian and TTg is a strongly twisted Gaussian. 


3.2. Autocorrelation assessment. In Figure |3.2| the numerical performance 
of the DIAM, AM, pGN, and RW are compared by looking at their autocorrelation 
functions with underlying distributions tti through TTg for d = 100. The step-size /3 
is adapted by targeting the optimal acceptance ratio range, which is 0.1 to 0.3 for 
AM and RW and is 0.3 to 0.5 for DIAM and pCN. It is chosen initially as 2.4/y/d, 
which is suggested in [SHI US]. The top four panels of Figure 3.2 show the autocorre¬ 


lation function of ip(x) = log 7 ri(a:), for i = 1,2,4, and 5, as a single global measure 
of DIAM, AM, pCN and RW. The middle and bottom four panels of Figureshow 
the autocorrelation functions of ip{x) = vjx and (p{x) = vfx, the projections onto the 
eigenvector associated to the largest eigenvalue and the smallest eigenvalue, respec¬ 
tively. One expects that DIAM will perform the best and RW will perform the worst. 
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The performance of pCN and AM is subtle since, on the one hand, pCN is dimension- 
independent but isotropic algorithm and may become competitive in high-dimensional 
and well-conditioned cases. On the other hand, the AM algorithm performs equally 
in all directions, although suffers from a 0{d) dependence on the dimension, and 
therefore it performs better than pCN for targets of modest dimension whose covari¬ 
ance has a large condition number. The condition number is deliberately increased in 
target 714 from 0{d) (for cr^ = 1/d) to 0{d'^), so that one can have a more clear idea 
on how AM and pCN react as the condition number of C increases. This is shown in 
the bottom left panels, where there are two curves for each of pCN and AM. The AM 
algorithm performs the same. The pCN algorithm, on the other hand, performs the 
same for the eigen-direction corresponding to the smallest eigenvalue (bottom panels), 
but performs significantly worse on both other functionals. Numerical experiments 
confirm the behavior described above. 

As mentioned, it is expected that the cost of a forward solve, say C{d), will be 
bounded by 0(d?), so these experiments should give a good measure of the general 
usability of the algorithm. For example, if the forward solve involves a dense matrix- 
vector multiplication it is 0{d^), if it involves an iterative solution of a sparse system, 
it is 0{d), and if it involves fast Fourier transform (FFT), it is O{d\og{d)). The 
argument found in [3S], Section 3.1, indicates that the scaling of pCN is roughly 
0{Na////^C{d)), where is the smallest eigenvalue of the posterior covariance, 
with whitened prior (approximately the inverse of the largest eigenvalue of the prior 
pre-conditioned Hessian mm), at least in the case of Gaussian targets. In turn, 
AM is 0{NTaax{d^,C{d)}d), and DIAM is 0{Nmayi{d'^,C{d)}). One can therefore 
conclude that DIAM will outperform AM, and DIAM will outperform pCN if C{d) > 
d^cr^in. am will outperform pCN only if C{d) > 


3.3. Impact of niag choice. The outcome of any MCMC simulation depends, 
aside the natural variations due to random sampling, on the specific way the run 
is performed. First of all, the chain length must be sufficient, and the burn-in has 
to be dealt with properly. In addition, any algorithm contains a number of tuning 
parameters that may decisively affect the results, and the frequency we update our 
proposal, denoted by u-iag, is one of the parameters that needs to be tuned. 

Test cases with target tti were run separately at various values of d. For each 
d = 100, 200,..., 500, and 800, niag varied over {d/IOO, d/10, d/4, d/2, d, 2d, 4d, lOd}, 
and the program was run until a certain stopping criterion has been reached. The 
number of samples necessary to reach convergence, normalized by the number for 
ffiag = d, is shown in the left panel of Fig. 3.2 as a function of niag/d. It is interesting 
that in fact the number of necessary samples increases for small enough niag. The 
corresponding time to convergence (not shown) is large for either small or large niag, 
due to the increased number of 0((P) operations in the former case and the increased 
number of required samples in the latter. The curves are not convex, although this is 
presumably due to random effects and it is expected that they would smooth out if 
averages were taken over sufficiently many simulations. While it would be interesting 
to identify the optimal value of niag and see if it converges over multiple values of d, 
and even targets, to a universal value, for the present purposes this is not necessary. 
It suffices to observe that the minimum occurs for some niag = 0[d). The value of 
niag is chosen as d/2 in the experiments to follow. This means that the total cost of 
the algorithm is 0{d^N), where N is the total number of samples. Similar effect could 
be obtained by performing low-rank Cholesky updates, although Fig. |3.2| indicates 
this may actually lead to a larger number of necessary samples to convergence for 



14 


Accelerated DIAM 


TTi 


7r2 




400 600 

Lag 



TTi 



0 200 400 600 800 1000 

7r4 



0 200 400 600 800 1000 

Lag 


TTi 



0 200 400 600 800 1000 

7r4 


^2 


0.8 

0.6 

0.4 

0.2 


•A 

•A 

\s 

\ S 




200 


400 600 

TTS 


800 


1000 



7r2 






\\ 


\\ 

\\ 




0 200 400 600 800 1000 

Lag 



Figure 3.1. Comparison of autocorrelation function of the log posterior (top four panels), 
and the projection onto the eigenvector associated with the largest eigenvalue (middle four) and the 
smallest eigenvalue (bottom four) of DIAM, AM, pCN, and RW on targets tt^ for 2 = 1,2, 4, 5. 
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’^lag = 0{1), hence a larger cost. Furthermore, profiling with this choice shows 
that Level 3 BLAS operations take less than 10% of the total simulation time, a 
consequence of the fact that for sufficiently large d, the time to complete d Level 
2 BLAS operations of cost is significantly greater than one d^ operation, due to 
memory constraints. This is discussed more in the next section. 

For the examples illustrated here, convergence is diagnosed based on the exactly 
computable moments. In general, however, such ad-hoc techniques as the potential 
scale reduction factor (PSRF) described in Section 2.4.2 are required. The PSRF 
for TTi with d = 1000 is shown in the right-hand panel of Fig. |3.2| over various 
P, illustrating its convergence. The convergence criterion that is used to stop the 
chains is when the relative error of the sample covariance with respect to the truth 
in the Frobenius-norm falls below some TOL. The same convergence criterion, with 
TOL= 0.001, is used for all the runs except for the tuning of njag. For the latter, we 
use the weaker convergence criterion of the absolute error of the sample mean with 
respect to the truth in the Euclidean norm, with TOL=0.01. 



(a) Total required number of samples as a func¬ 
tion of niag/d (normalized by the number for 
niag = d), to satisfy a given convergence cri¬ 
terion. The missing points for small riiag and 
larger d correspond to a “max-time” criterion of 
12 hours. 



(b) PSRF convergence criterion for a range of 
number of chains P = 4,10,16, for d = 1000, 
with the number of outer batch iterations, k 
given on the x-axis. In this case, the chains are 
stopped when our convergence criterion is satis¬ 
fied. 


Figure 3.2. Tuning and convergence diagnostic. 


4. High performance implementation. In this section, we describe the high 
performance implementation of the DIAM algorithm using standard x86 and GPU- 
accelerated numerical libraries. 


4.1. Typical CPU-GPU Architecture Ecosystem. Today’s hardware land¬ 
scape is composed of lightweight x86 multicores associated with accelerators through 
a weak link called the Peripheral Component Interconnect Express (PCIe), as de¬ 
picted in Figure 4.1 The architectural discrepancies between the host (CPU) and 


the device (GPU) are manifest. GPU accelerators have thousands of CUBA cores, 
which provide unprecedented parallel performance and computing capabilities, i.e., 
more than an order of magnitude higher in terms of theoretical peak performance 
compared to the standard x86 CPU. Moreover, the speed to fetch data from GPU 
main memory is higher than the standard x86 CPU’s bandwidth, by a factor of two 
or more, depending on the CPU system specifications. In our testbed, it is almost a 
factor of five. However, the PCIe bus cannot transfer the data from the CPU mem¬ 
ory to the GPU memory as fast as the latter can compute. And this is precisely 
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where the challenge resides, in maintaining the CUBA cores always busy and not 
starving for computational work. This problem is further exacerbated by the limited 
size of the GPU memory, which can be smaller by one or two orders of magnitude, 
compared to the CPU memory. All in all, application performance can usually be 
leveraged using GPU technology (i.e., massive thread parallelism, high computing 
power and high memory bandwidth) as long as the overhead of moving data across 
the PCIe bus can be mitigated by using communication-reducing algorithms and/or 
mandatory communications can be overlapped by useful computations. 


Commodity 

Intel Xeon 
8 cores 3 GHz 
8*4 ops/cycle 
96 Gflop/s (DP) 


Accelerator (GPU) 

Nvidia K20X “Kepler” 
2688 “Cuda cores" 

.732 GHz 

2688*2/3 ops/cycle 
1.31 Tflop/s (DP) 


192 Cuda cores/SMX 




PCI-X 16 lane 
64 Gb/s (8 GB/s) 
1 GW/s 


Device Memory 


Figure 4.1. CPU-GPU hardware architecture m- 


4.2. High performance CPU-GPU numerical software stack. Fortunately, 
the high performance numerical software stack targeting the complexity of the CPU- 
GPU hardware is rich in kernel implementations and available from optimized open- 
source and vendor distributions. In particular, dense linear algebra (DLA) operations 
are well-supported on multicore and hardware accelerators, thanks to their regularity 
in terms of memory accesses. The fundamental DLA kernels are categorized in three 
levels: Level 1, 2 and 3, which form the basic linear algebra subroutines (BLAS) 
library. Level 1 BLAS involves vector-vector operations (e.g., dot product). Level 
2 BLAS corresponds to matrix-vector operations (e.g., matrix-vector multiplication) 
and Level 3 BLAS includes matrix-matrix operations (e.g., matrix-matrix multiplica¬ 
tion). While Level 1 and 2 BLAS operations are mostly memory-bound (limited by 
the bus bandwidth). Level 3 BLAS kernels are compute-bound thanks to a higher data 
reuse rate. BLAS kernels are often at the bottom of the software chain and, therefore, 
are critical for parallel performance. Vendors provide support for the BLAS kernels on 
their respective architectures. For instance, Intel provides its own high performance 
BLAS library on GPUs, distributed in the Math Kernel Library (MKL) [35]. On 
GPUs, NVIDIA provides the cuBLAS library |35j, which implements BLAS kernels 
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using the CUBA programming model [57]. The open-source KAUST BLAS (KBLAS) 
library [T] provides also a subset of Level 2 BLAS operations on GPUs, which per¬ 
forms better than the corresponding kernel from NVIDIA cuBLAS. Last but not least, 
LAPACK [2| provides CPU implementations of high-level DLA operations, such as 
solvers of linear equations and covariance (symmetric) matrix inversion. 

4.3. The DIAM software framework. Below is the work flow of DIAM Elfor 
sampling the target tt : —>■ [0, Ij. There is an evaluation of the log target at 


Algorithm 1 DIAM algorithm 

Initialize xq A(0,ld), Aq =Id, /3 = min{2.4/A/(i, 0.5}, n = 0,naccepted = 0; 


for convergence criterion > TOL do 

Propose: x* * = x^ef + V^l - /32(a;„ - x^ei) + 
u ~ Uniform(u; 0,1) 

logo = \ogTr{x*) + ^{A-^x*y {A-^x*) - (lognixn) + ^{A-^x^y {A-^x„)) 
if log u < log a then 

Accept the proposal. — X , ^accepted — ^accepted 1; 

else 

Reject the proposal: Xn+i = Xn 

end if 


if n = fcniag, k G Z then 

Calculate acceptance ratio a = naccepted/j^iag and update /3 (increase if 
a > Oinax or decrease if if d < Omin); ^accepted = 0; 

Calculate empirical mean and covariance mn,Cn as (2.15), (2.14); 
Update An = Cholesky(C'„); Compute 


Batch update , . . . , ^n+ni^g] — (^C^An [lUn-t-l ? • ■ • ? 5 

where Wm ~ A(0,ld) i.i.d.; 
end if 


n = n + 1', 

end for 


each iteration, which is a Level 2 BLAS operation for all of our random targets, and 
another Level 2 BLAS evaluation for the multiplication by in the evaluation of 
the weighted quadratic. Every niag iterations there is a Level 2 BLAS operation for 
evaluation of the mean, and Level 3 BLAS operations for evaluation of the second 
moment, Cholesky-based matrix inversion, and evaluation of the next niag random 
search directions. Nonetheless, the bottleneck with increasing dimension turns out 
to be the niag Level 2 BLAS operations in between updates, given that niag = 0{d) 
and the Level 2 BLAS operations are memory-bound. Notice Am^^+m = A^^^ for 
Txi < niag. Therefore, from this work flow, DIAM framework is basically composed by 
the following Level 2 and 3 BLAS operations: 

• LARNV: random matrix generation function (auxiliary LAPACK function). 

• TRMV: performs triangular matrix-vector operations (Level 2 BLAS). 

• SYMV: performs symmetric matrix-vector operation (Level 2 BLAS). 

• GEMV: performs general matrix-vector operations (Level 2 BLAS). 


^It should be noted that there are other empirical details which are omitted. For example, a 
transient number of initial iterates no are collected, with burn-in discarded, before the covariance is 
updated for the first time. 
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• SYR: performs the symmetric rank 1 operation (Level 2 BLAS). 

• GEMM: performs general matrix-matrix operations (Level 3 BLAS). 

• POTRF: performs Cholesky factorization (LAPACK function, mostly com¬ 
posed of Level 3 BLAS). 

• POTRI: computes the inverse of a real symmetric positive definite matrix A 
using the Cholesky factorization (POTRF) A = U'^U or A = LL^ (LAPACK 
function, mostly composed of Level 3 BLAS). 

All these functions are available from the high performance numerical CPU and GPU 
libraries, introduced in Section 14.2 


4.4. Single chain parallelization implementation challenges. The chal¬ 
lenge now resides in composing with all libraries and in determining which kernels 
need to run on which platform. Level 2 and 3 BLAS operations usually perform 
best on GPUs, i.e., the Cholesky-based symmetric matrix inversion of the sample 
covariance computation from Equation (2.14) and the dense matrix-vector multipli¬ 
cation, as highlighted in Equation (2.16). On the one hand, the Cholesky-based 


matrix inversion is compute-intensive and its complexity may impede performance 
scalability of the overall parallel DIAM approach, if frequently requested for solving 
high-dimension problems. On the other hand, the dense matrix-vector multiplication 
is memory-bound and, therefore, exhibits a lower arithmetic complexity and slows the 
parallel DIAM implementation down if it becomes predominant. The lag-time is then 
paramount to balance these two operations and to further reduce the time to solution, 
and it warrants further investigation. We rely on existing high-performance imple¬ 
mentations of both operations: we use the KBLAS [1] and the NVIDIA cuBLAS [58] 
libraries for the Level 2 BLAS operations on GPU occurring each iteration and the In¬ 
tel MKL library |55| to perform the Cholesky-based matrix inversion and other Level 
3 BLAS operations occurring once every niag iterations. This hybrid CPU-GPU im¬ 
plementation requires the data movement between CPU and GPU memory through 
the slow PCIe link. Ideally, one should try to operate on persistent data once on 
GPU memory to increase data reuse within the simulation. When this is not feasible, 
data motion has to be hidden using asynchronous data communication to mitigate 
the overhead of the slow PCIe bridge. The cuBLAS and KBLAS libraries provide API 
functionalities to ensure communication can be overlapped with computation, through 
the CUDA programming model using the function CUDA_MEMCPY_ASYNC. 

4.5. Concurrent chain parallelization using multithreading. The degree 
of parallelism of DIAM can be further leveraged by running concurrent chains (see 
Section 4.5). Thanks to the POSIX threads programming model (Pthreads), threads 


are instantiated and work in an embarrassingly parallel fashion. We rely on the usual 
fork and join parallel programming model to take advantage of the parallelism exposed 
by the concurrent chains. Once P threads are created, each thread p will have its 
own private memory containing all needed information to independently process, as 
depicted in Figure In Figure |4.2[ k denotes the number of batches which have 
been done. At the end of each batch processing, the threads are joined using a shared 
memory lock to facilitate and ensure safe synchronization. This may engender load 
imbalance if the workload per thread is not similar. However, this can be overcome 
using a more sophisticated dynamic scheduler to reduce the idle time [82] . 

This second level of parallelism introduces another complexity on the CPU be¬ 
cause it mixes threads created by the Intel MKL library (OpenMP) as well as the 
concurrent chains (Pthreads). Indeed, MKL implements multithreading in BLAS 
functions and the default number of threads MKL uses corresponds to the num- 
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Each chain run its own DIAM with length = 



Figure 4.2. DIAM using the fork and join parallel programming model. 


Specifications 


CPU 

2 

Cores/CPU 

10 

Clock frequency (GHz) 

2.8 

Gache size (MB) 

25 

Memory Bandwidth (GB/s) 

59.7 

Main Memory (GB) 

256 

PCI Express 

3.0 


Table 5.1 

Specifications for Intel Xeon Ivy 
Bridge E5-2680 v2. 


P 

Total 
time [s] 

Time per 
batch [s] 

PMATniag 

1 

42517.73 

5.98 

142260000 

2 

22779.92 

8.46 

107760000 

4 

9486.76 

8.78 

86480000 

6 

5878.11 

9.09 

77640000 

8 

4466.00 

9.67 

73920000 

10 

3506.19 

9.9 

70800000 

12 

3215.66 

11.01 

70080000 

14 

3024.47 

12.1 

70000000 

16 

2962.85 

13.47 

70400000 


Table 5.2 


Scaling to concurrent chains in terms of 
convergence time. Here the total number of 
samples is N = PMKni^^. 


ber of physical cores available on the system, except if the environment variable 
MKL_NUM_THREADS is defined by the user. Thus, the total number of threads 
running in the system is P x Pmki, where P is the number of chains launched and Pmki 
is the number of threads MKL functions fork. When P x Pmki is higher than the actu¬ 
ally number of cores (Pcores) the system has, the overall performance may drop down 
because of thread oversubscription. Therefore, it is critical to keep P x P^ki < Pcores- 

5. Performance results. This section presents the performance results of var¬ 
ious DIAM implementations. 

5.1. Environment settings. Table |5.1| defines the CPU specifications of the 
computing system used in these experiments. Sustained bandwidth is determined by 
the Stream benchmark. The total number of cores is 20. 

The system has three NVIDIA Tesla K40 GPU Accelerators with 1.4 TFLOPS 
sustained performance, 12 GB memory, and ultra-fast memory bandwidth 288 GB/s 
each. The machine runs Ubuntu 14.04.1 LTS and provides Intel Compilers Suite 
vl3.0 together with the MKL library. The DIAM code is written in C and relies on 
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OpenMP for MKL and Pthreads for the multiple chains implementation as well as 
CUBA through cuBLAS and KBLAS, for the CPU and GPU interfaces, respectively. 

5.2. Empirical tuning. One can notice that Level 3 BLAS functions in DIAM 
are only called every d/2 iterations, reducing the algorithm complexity to 0{(P). The 
strategy we use here is that when dealing with small problems, e.g., problem sizes 
smaller than 1000, the optimized Intel MKL [^, is preferred (only CPU), while when 
dealing with larger problems, e.g., problem sizes larger than 1000, high performance 
libraries such as cuBLAS [58] and KBLAS [T| are preferred (GPU). This tuning choice 
helps mitigate the overhead of copying data between the host (CPU) and the device 
(GPU). 

5.3. CPU-GPU performance profiling. Performance profiling of the MKL- 
based DIAM CPU implementation indicates that, as the dimension increases, SYMV 
becomes the bottleneck and impedes scaling to higher dimensions. SYMV is a Level 
2 BLAS function and, thus, is limited by the bus bandwidth. As described in Sec¬ 
tion accelerators provide several times higher bandwidth compared to standard 
x86 architecture and, therefore, memory-bound kernels can still be accelerated on 
such hardware. 


5.4. Performance scalability of DIAM. One of the approaches to statis¬ 
tical inference in high dimensions, beside algorithm improvement, is to reorganize 
the code into a faster implementation. In Figure 5.1 (a), we show performance 
scalability in seconds to collect 10® samples from d = 100 to d = 10000 using 
MKL sequential (by setting MKL_NUM_THREADS=1), MKL parallel (by setting 
MKL_NUM_THREADS=20) and MKL-KBLAS (hybrid) high performance libraries 
combined. The target distribution used here is tti. The MKL-KBLAS curve rep- 



fa) Performance scalability to collect le5 sam¬ 
ples. 



(b) Scalability of concurrent chains. 


Figure 5.1. Performance scalability. 


resents the implementation using both MKL and GPU-libraries. The MKL curve 
represents the implementation only using MKL and run with 20 threads by internally 
calling OpenMP, and MKL sequential represents the implementation written on C 
and run only with one thread with no parallel techniques involved. The time required 
to collect 10® samples of MKL-KBLAS code outperforms that of MKL parallel code 
for d > 3000. Fitting these three curves to quadratic functions results in the following: 

• MKL-KBLAS T = 56.39 - 0.036d -b 1.34 x 10-®^^ 

• MKL Parallel T = 7.49 - 0.033d -b 4.32 x IQ-^d'^ 

• MKL Sequential T = 253.63 - 0.3983d -b 1.65 x lO-'^d^ 

These functions make it easy to read the dquad such that quadratic scaling begins, as 
well as the asymptotic gain factor of between 3 — 4 in MKL-KBLAS as compared to 
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MKL alone, and similar between MKL parallel and serial. 

5.5. Performance scalability of concurrent chain DIAM. In Figure [5dt b), 
the scaling to concurrent chains is illustrated, for target tti with d = 1000, and 
M = AO fixed. The scaling is essentially T oc P~^ at first, but for P > 10 it slows 
down, on a machine with 20 cores (see discussion at the end of 4.5). This algorithm is 
memory-bound and needs synchronization after each chain generates a certain number 
of samples, thus, once the memory bandwidth is saturated, adding more threads will 
have limited benefit because more time is spent in each batch (the interval between 
each two synchronization, see Table 5.2). We refer to the results on Figure [5T| (b) 
“subtle” strong scaling because, in contrast to the traditional strong scaling, the 
problem size actually is shrinking, namely, the total number of samples required to 
get convergence is decreasing as we add more chains. This can therefore still be 
considered a form of strong scaling because the same convergence criterion is used, 
and in this sense the problem is the same. However, it is clear that the reduction 
in number of samples is converging. The reduction in required number of samples is 
likely due to the fact that more chains translates to more total samples used for a 
given update of the proposal covariance, hence the proposal adapts faster. 

These experiments performed on shared-memory systems suggest new opportuni¬ 
ties in further scaling DIAM to multiple distributed-memory nodes. As shown in this 
section, single-node performance starts to decay after running beyond one socket (i.e., 
ten cores in our testbed) due to the saturation of the bus bandwidth, which is typical 
for memory-bound applications. We can then weak-scale the simulation by adding 
more nodes, each equipped with GPUs, and solve higher dimensional problems on 
a distributed-memory environment using the Message Passing Interface (MPI) [5^ . 
The synchronization scheme described in Fig. |4.2| will have to be adjusted and explicit 
function calls will have to be made in order to handle communications across the com¬ 
putational nodes. In particular, collective communication operations will be required 
to synchronize between the distributed nodes. This may generate overheads due to 
the higher latency and lower bandwidth of the network interconnect when moving 
data off-chip. However, the latest MPI 3.0 standard allows for non-blocking collective 
communication operations, which may mitigate the overheads when running on large 
distributed-memory systems. 


6. Summary. A black-box MCMC algorithm is introduced for Bayesian infer¬ 
ence of highly anisotropic targets in high dimensions, herein named DIAM. In par¬ 
ticular, it is illustrated that for Gaussian target distributions the integrated autocor¬ 
relation time, and hence efhciency of the algorithm, is independent of the underlying 
dimension, asymptotically as the number of samples tends to infinity. The algorithm 
is illustrated to perform as expected on Gaussian targets, and also performs favor¬ 
ably with respect to standard AM on non-Gaussian targets. These algorithms are 
also compared to some other standard Metropolis variants. GPU-accelerated Level 2 
operations enable the efficient exploration of high-dimensional targets with d > 1000. 
The speedup versus standard serial G code is a factor of twelve as dimension tends to 
infinity. This improvement in conjunction with the combination of concurrent chains 
(justified a posteriori) may in principle allow exploration of very high-dimensional 
targets. A form of strong scaling with respect to convergence time is illustrated on 
up to 16 cores. The parallelization strategy used for DIAM algorithm will work also 
for the standard AM algorithm. 
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